feat(storage): shard repo-axis GSI1 partition (#23) by pureliture · Pull Request #34 · source-security-dev/security-scanner

pureliture · 2026-06-17T01:34:55Z

What

Shard the per-repo REPO#<repo> GSI1 partition into
REPO#<repo>#SHARD#00..15 (repoAxisVersion=2, repoAxisShardCount=16) so
per-repo entity rows no longer collide on a single hot partition, while
repo-local reads stay logically identical.

Why

The single REPO#<repo> GSI1 partition is a hot-partition risk at cloud scale
(#23): every per-repo entity (FINDING, FINDING_OBSERVATION, FINDING_STATE,
STATE_EVENT, SCAN_LEDGER, GHAS_ALERT) writes to one partition key per repo.

Changes

repo_axis.py: RepoAxisKey + repo_axis_inputs — single source of truth for
the per-entity sharded key formula, shared by write mappers and backfill so the
two cannot drift. Sole owner of the REPO# / #SHARD# literals.
items.py: all 6 repo-axis mappers build the base item, then merge the sharded
GSI1 projection derived from the item's own fields. STATE_EVENT gsi1sk gains
an eventSeq tie-breaker suffix; the rule-axis GSI2 is unchanged.
repo_axis_reader.py: read_repo_axis scatter-gather fan-out across all shards
with a migration-only include_legacy fallback; fails closed on a shard error,
dedupes by (PK, SK) preferring the higher repoAxisVersion, returns canonical
(gsi1sk, PK, SK) order.
store.py / branch_residual.py: read_observations_for_repo and
residual_for_repo route through the fan-out reader (include_legacy threaded,
default False). Direct primary-key reads (read_finding_state,
has_scan_ledger, read_finding_state_events) are unchanged.
repo_axis_migration.py: in-place conditional backfill (not a row copy) with a
per-entity inventory / backfilled / skipped / failed / remaining report and a
documented compatibility-removal gate.
Tests: new tests/test_repo_axis_sharding.py (helper, fan-out reader,
migration, regression scan); updated store and CLI residual tests. The
regression scan bans any #SHARD# literal, raw gsi1pk=REPO#, and GSI1 reads
on the raw REPO# partition outside the single helper module.

Implements docs/workbench/specs/issue-23-repo-gsi1-sharding/design.md (M1–M5)
plus the post-implementation multi-agent review fixes.

Test

uv run pytest — full suite green (616 passed)
uv run ruff check / uv run ruff format — clean

Checklist

Tests pass locally
Lint/format clean
Behavior-preserving for existing reads (primary-key paths unchanged)
Cloud rollout / canary / backfill execution — out of scope (design gate only)

Closes #23

gemini-code-assist

Code Review

This pull request implements GSI1 sharding for the REPO# partition to mitigate hot partition risks under Issue #23. It introduces a deterministic 16-shard key generator, a scatter-gather reader to merge sharded results, and an in-place backfill migration utility, while updating write mappers and repository stores to support the sharded schema. The review feedback highlights two key improvement opportunities: first, a potential NameError on Python versions prior to 3.11 in transport.py due to referencing DynamoDbCompatibleConfig inside its own class without future annotations; second, a performance issue in repo_axis_migration.py where redundant full table scans are performed to compute remaining legacy items, which can be optimized by setting remaining = failed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T01:37:18Z


    @classmethod
-    def from_env(cls) -> "DynamoDbCompatibleConfig":
+    def from_env(cls) -> DynamoDbCompatibleConfig:


Since from __future__ import annotations is not imported at the top of this file, referencing DynamoDbCompatibleConfig as a return type annotation inside its own class definition will raise a NameError at import time on Python versions prior to 3.11.

To prevent runtime import errors on older Python versions, please revert this annotation to a string literal "DynamoDbCompatibleConfig", or add from __future__ import annotations at the very top of the file.

Suggested change

def from_env(cls) -> DynamoDbCompatibleConfig:

def from_env(cls) -> "DynamoDbCompatibleConfig":

gemini-code-assist · 2026-06-17T01:37:18Z

+                skipped += 1
+            else:
+                failed += 1
+        remaining = len(_scan_legacy(table, entity_type))


In DynamoDB, a FilterExpression is applied after the scan reads the data from disk but before returning it to the client. This means a full table scan always consumes the same amount of Read Capacity Units (RCUs) regardless of the filter.

Calling _scan_legacy(table, entity_type) again here to compute remaining triggers an additional full table scan for each entity type (6 extra scans in total). Since any item that was successfully backfilled or skipped is no longer legacy, the number of remaining legacy items is exactly equal to the number of failed items. You can directly assign remaining = failed to completely eliminate these 6 expensive scans.

Suggested change

remaining = len(_scan_legacy(table, entity_type))

remaining = failed

Spread per-repo rows across REPO#<repo>#SHARD#00..15 (repoAxisVersion=2, shardCount=16) to remove the single hot REPO#<repo> GSI1 partition. - RepoAxisKey + repo_axis_inputs: single source of truth for the per-entity sharded key formula; write mappers and backfill share it (no drift). - read_repo_axis: scatter-gather fan-out with migration-only include_legacy fallback, fail-closed on shard error, dedupe by (PK,SK) preferring higher repoAxisVersion, canonical (gsi1sk,PK,SK) order. - read_observations_for_repo / residual_for_repo route through the fan-out reader; direct primary-key reads unchanged. - repo_axis_migration: in-place conditional backfill (not row copy) with a per-entity inventory/backfilled/skipped/failed/remaining report and a documented compatibility-removal gate. - regression scan: single allowed construction site; bans any #SHARD# literal, raw gsi1pk=REPO#, and GSI1 reads on the raw REPO# partition. Implements docs/workbench/specs/issue-23-repo-gsi1-sharding/design.md (M1-M5) and the post-implementation multi-agent review fixes. Full suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

pureliture force-pushed the claude/issue-23-repo-gsi1-sharding branch from 7ec55e7 to de4facc Compare June 18, 2026 16:07

pureliture merged commit 6648d7a into main Jun 18, 2026
9 checks passed

pureliture deleted the claude/issue-23-repo-gsi1-sharding branch June 18, 2026 16:15

This was referenced Jun 18, 2026

feat(cli): add backfill-repo-axis migration command (#23 follow-up) #37

Merged

feat(storage): shard list/index GSI hot partitions (#23 follow-on) #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): shard repo-axis GSI1 partition (#23)#34

feat(storage): shard repo-axis GSI1 partition (#23)#34
pureliture merged 1 commit into
mainfrom
claude/issue-23-repo-gsi1-sharding

pureliture commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def from_env(cls) -> DynamoDbCompatibleConfig:
	def from_env(cls) -> "DynamoDbCompatibleConfig":

	remaining = len(_scan_legacy(table, entity_type))
	remaining = failed

Conversation

pureliture commented Jun 17, 2026

What

Why

Changes

Test

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant